-
Notifications
You must be signed in to change notification settings - Fork 8
Evaluation: Use Config Management #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReplaces embedded evaluation config dicts with stored references ( Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/API
participant API as Evaluations API
participant Service as Evaluations Service
participant DB as Config DB
participant Resolver as Config Blob Resolver
participant Batch as Batch Orchestrator
Client->>API: POST /evaluations (dataset_id, config_id, config_version, ...)
API->>Service: start_evaluation(dataset_id, config_id, config_version, ...)
Service->>DB: ConfigVersionCrud.get_by_id(config_id, config_version)
DB-->>Service: ConfigVersion record (with blob)
Service->>Resolver: resolve_config_blob(config_version.blob)
Resolver-->>Service: LLMCallConfig (completion.params, model)
Service->>Service: validate provider == OPENAI
Service->>Batch: start_evaluation_batch(resolved_completion_params, model)
Batch-->>Service: batch run created / status
Service-->>API: 201 Created (EvaluationRun with config_id/config_version)
API-->>Client: 201 Created (response)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
backend/app/crud/evaluations/embeddings.py (1)
366-367: Misleading comment - update to reflect actual behavior.The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.
- # Get embedding model from config (default: text-embedding-3-large) - embedding_model = "text-embedding-3-large" + # Use fixed embedding model (text-embedding-3-large) + embedding_model = "text-embedding-3-large"backend/app/tests/api/routes/test_evaluation.py (1)
524-545: Consider renaming function to match its new purpose.The function
test_start_batch_evaluation_missing_modelwas repurposed to test invalidconfig_idscenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.- def test_start_batch_evaluation_missing_model(self, client, user_api_key_header): - """Test batch evaluation fails with invalid config_id.""" + def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header): + """Test batch evaluation fails with invalid config_id."""backend/app/api/routes/evaluation.py (1)
499-510: Consider validating thatmodelis present in config params.The model is extracted with
.get("model")which returnsNoneif not present. Sincemodelis critical for cost tracking (used increate_langfuse_dataset_run), consider validating its presence and returning an error if missing.# Extract model from config for storage model = config.completion.params.get("model") + if not model: + raise HTTPException( + status_code=400, + detail="Config must specify a 'model' in completion params for evaluation", + ) # Create EvaluationRun record with config referencesbackend/app/crud/evaluations/core.py (1)
15-69: Config-basedcreate_evaluation_runrefactor is correctly implemented; consider loggingmodelfor improved traceability.The refactor from inline config dict to
config_id: UUIDandconfig_version: intis properly implemented throughout:
- The sole call site in
backend/app/api/routes/evaluation.py:503correctly passes all new parameters with the right types (config_idas UUID,config_versionas int,modelextracted from config).- The
EvaluationRunmodel inbackend/app/models/evaluation.pycorrectly defines all three fields with appropriate types and descriptions.- All type hints align with Python 3.11+ guidelines.
One suggested improvement for debugging:
Include
modelin the creation log for better traceability when correlating evaluation runs with model versions:logger.info( f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, " - f"config_id={config_id}, config_version={config_version}" + f"config_id={config_id}, config_version={config_version}, model={model}" )Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py(1 hunks)backend/app/api/routes/evaluation.py(5 hunks)backend/app/crud/evaluations/core.py(5 hunks)backend/app/crud/evaluations/embeddings.py(1 hunks)backend/app/crud/evaluations/processing.py(1 hunks)backend/app/models/evaluation.py(3 hunks)backend/app/tests/api/routes/test_evaluation.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/api/routes/evaluation.pybackend/app/models/evaluation.pybackend/app/crud/evaluations/embeddings.pybackend/app/tests/api/routes/test_evaluation.pybackend/app/crud/evaluations/processing.pybackend/app/crud/evaluations/core.pybackend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
backend/app/api/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Expose FastAPI REST endpoints under backend/app/api/ organized by domain
Files:
backend/app/api/routes/evaluation.py
backend/app/models/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Define SQLModel entities (database tables and domain objects) in backend/app/models/
Files:
backend/app/models/evaluation.py
backend/app/crud/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement database access operations in backend/app/crud/
Files:
backend/app/crud/evaluations/embeddings.pybackend/app/crud/evaluations/processing.pybackend/app/crud/evaluations/core.py
🧬 Code graph analysis (2)
backend/app/tests/api/routes/test_evaluation.py (2)
backend/app/crud/evaluations/batch.py (1)
build_evaluation_jsonl(62-115)backend/app/models/evaluation.py (2)
EvaluationDataset(74-130)EvaluationRun(133-248)
backend/app/crud/evaluations/processing.py (1)
backend/app/crud/evaluations/langfuse.py (1)
create_langfuse_dataset_run(20-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (3)
backend/app/crud/evaluations/processing.py (1)
257-264: LGTM! Clean refactor to use stored model field.The change correctly retrieves the model from
eval_run.modelinstead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.backend/app/models/evaluation.py (1)
148-157: LGTM! Well-structured config reference fields.The new
config_idandconfig_versionfields properly establish the relationship to stored configs with appropriate constraints (ge=1for version). The nullable design allows backward compatibility with existing data.backend/app/api/routes/evaluation.py (1)
478-495: LGTM! Robust config resolution with provider validation.The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.
backend/app/alembic/versions/040_add_config_in_evals_run_table.py
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
backend/app/models/evaluation.py (1)
148-158: AlignEvaluationRuntype hints with nullable DB columns for config fields
config_idandconfig_versionare nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.Consider updating the annotations to reflect nullability:
- config_id: UUID = SQLField( + config_id: UUID | None = SQLField( foreign_key="config.id", nullable=True, description="Reference to the stored config used for this evaluation", ) - config_version: int = SQLField( + config_version: int | None = SQLField( nullable=True, ge=1, description="Version of the config used for this evaluation", )This keeps the schema the same while making runtime and type expectations clearer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/models/evaluation.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use type hints in Python code (Python 3.11+ project)
Files:
backend/app/models/evaluation.py
backend/app/models/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Define SQLModel entities (database tables and domain objects) in backend/app/models/
Files:
backend/app/models/evaluation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/models/evaluation.py (1)
271-273: Public model nullability now matches the schemaMaking
config_id,config_version, andmodelnullable inEvaluationRunPubliccorrectly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
Outdated
Show resolved
Hide resolved
f5b94b0 to
cdb0b2e
Compare
…ssistant_id handling
…nstead of config dict
…ig_version fields
…ve_model_from_config function, and update processing logic to use config references
c9cc51a to
a2c8a95
Compare
Prajna1999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
hold merge- untill frontend is ready. |
|
good to go. Can be merged |
…sion - Resolved merge conflicts with main branch (ef56025 refactor) - Updated evaluation routes and services to use config_id/config_version instead of inline config dict - Deleted old evaluation.py route (replaced by new evaluations/ directory) - Updated tests to use config_id/config_version pattern - Kept resolve_model_from_config function for cron job processing Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/crud/evaluations/core.py`:
- Around line 66-69: The logger.info call that reports the created EvaluationRun
(the one referencing eval_run.id, run_name, config_id, config_version) must be
prefixed with the function name tag; update the log message in
create_evaluation_run to start with "[create_evaluation_run]" so the entry reads
like "[create_evaluation_run] Created EvaluationRun record: id=...,
run_name=..., config_id=..., config_version=...". Ensure you only change the log
string (the logger.info call) and keep the existing variables (eval_run,
run_name, config_id, config_version) intact.
♻️ Duplicate comments (3)
backend/app/services/llm/providers/registry.py (1)
16-24: Handle the newOPENAIalias inget_llm_provider.Line 16-24 registers
"openai"but Line 66-74 only acceptsOPENAI_NATIVE, soprovider_type="openai"will raiseValueErrordespite being registered. This makes the new provider unusable.🔧 Proposed fix
- if provider_type == LLMProvider.OPENAI_NATIVE: + if provider_type in (LLMProvider.OPENAI_NATIVE, LLMProvider.OPENAI): if "api_key" not in credentials: raise ValueError("OpenAI credentials not configured for this project.") client = OpenAI(api_key=credentials["api_key"])Also applies to: 66-74
backend/app/crud/evaluations/batch.py (1)
107-119: Avoid sendingNonefields in the Responses API payload.Line 107-119 always includes optional fields and always builds
tools, sovector_store_ids=Noneis sent when no KB is configured. This can cause request failures. Build the body conditionally and omitNonevalues.🔧 Proposed fix (conditional body)
- batch_request = { - "custom_id": item["id"], - "method": "POST", - "url": "/v1/responses", - "body": { - # Use config as-is - "model": config.model, - "instructions": config.instructions, - "temperature": config.temperature, - "reasoning": {"effort": config.reasoning} if config.reasoning else None, - "tools": [ - { - "type": "file_search", - "vector_store_ids": config.knowledge_base_ids, - "max_num_results": config.max_num_results or 20, - } - ], - "input": question, # Add input from dataset - }, - } + body: dict[str, Any] = {"model": config.model, "input": question} + if config.instructions is not None: + body["instructions"] = config.instructions + if config.temperature is not None: + body["temperature"] = config.temperature + if config.reasoning is not None: + body["reasoning"] = {"effort": config.reasoning} + if config.knowledge_base_ids: + body["tools"] = [ + { + "type": "file_search", + "vector_store_ids": config.knowledge_base_ids, + "max_num_results": config.max_num_results or 20, + } + ] + + batch_request = { + "custom_id": item["id"], + "method": "POST", + "url": "/v1/responses", + "body": body, + }OpenAI Responses API: are optional fields allowed to be null, and must tools.vector_store_ids be a non-null list?backend/app/crud/evaluations/core.py (1)
343-366: Validate completion params and model before returning.Line 366 dereferences
config.completion.params.modelwithout checking ifcompletion/paramsexist or ifmodelis empty, which can raise or violate the declaredstrreturn contract.🔧 Proposed fix
- model = config.completion.params.model - return model + if not config.completion or not config.completion.params: + raise ValueError( + f"Config for evaluation {eval_run.id} is missing completion params " + f"(config_id={eval_run.config_id}, version={eval_run.config_version})" + ) + + model = config.completion.params.model + if not model: + raise ValueError( + f"Config for evaluation {eval_run.id} has no model specified " + f"(config_id={eval_run.config_id}, version={eval_run.config_version})" + ) + return model
🧹 Nitpick comments (2)
backend/app/services/evaluations/evaluation.py (1)
101-114: Add explicit None check for type safety.After
resolve_config_blobreturns, the type ofconfigisConfigBlob | None. While the function contract impliesconfigis non-None whenerroris None, accessingconfig.completion.provideron line 110 without an explicit check could cause issues with type checkers and is less defensive.🛠️ Suggested improvement
config, error = resolve_config_blob( config_crud=config_version_crud, config=LLMCallConfig(id=config_id, version=config_version), ) - if error: + if error or config is None: raise HTTPException( status_code=400, detail=f"Failed to resolve config from stored config: {error}", ) - elif config.completion.provider != LLMProvider.OPENAI: + + if config.completion.provider != LLMProvider.OPENAI: raise HTTPException( status_code=422, detail="Only 'openai' provider is supported for evaluation configs", )backend/app/tests/api/routes/test_evaluation.py (1)
554-577: Test name is inconsistent with test behavior after refactor.The test method is named
test_start_batch_evaluation_missing_modelbut the docstring and actual test behavior validate that the evaluation fails with an invalid/non-existentconfig_id. Consider renaming for clarity.🛠️ Suggested rename
- def test_start_batch_evaluation_missing_model( + def test_start_batch_evaluation_invalid_config_id( self, client: TestClient, user_api_key_header: dict[str, str] ) -> None: """Test batch evaluation fails with invalid config_id."""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
backend/app/tests/crud/evaluations/test_processing.py (6)
260-291: Add a type hint fortest_dataset.
test_datasetis untyped in this fixture. The guidelines require type hints on all params/returns. As per coding guidelines, please annotate the fixture parameter.🔧 Proposed fix
- def eval_run_with_batch(self, db: Session, test_dataset) -> EvaluationRun: + def eval_run_with_batch( + self, db: Session, test_dataset: EvaluationDataset + ) -> EvaluationRun:
402-418: Add missing type hints and return type.This async test lacks a type hint for
test_datasetand a return type. As per coding guidelines, please annotate both.🔧 Proposed fix
- async def test_process_completed_evaluation_no_batch_job_id( - self, db: Session, test_dataset - ): + async def test_process_completed_evaluation_no_batch_job_id( + self, db: Session, test_dataset: EvaluationDataset + ) -> None:
459-491: Add a type hint fortest_dataset.This fixture parameter is untyped. As per coding guidelines, please add the type hint.
🔧 Proposed fix
- def eval_run_with_embedding_batch(self, db: Session, test_dataset) -> EvaluationRun: + def eval_run_with_embedding_batch( + self, db: Session, test_dataset: EvaluationDataset + ) -> EvaluationRun:
606-648: Add missing type hints and return type.
test_datasetis untyped and the return type is missing. As per coding guidelines, please annotate both.🔧 Proposed fix
- async def test_check_and_process_evaluation_completed( + async def test_check_and_process_evaluation_completed( self, mock_process, mock_poll, mock_get_batch, db: Session, - test_dataset, - ): + test_dataset: EvaluationDataset, + ) -> None:
675-717: Add missing type hints and return type.
test_datasetis untyped and the return type is missing. As per coding guidelines, please annotate both.🔧 Proposed fix
- async def test_check_and_process_evaluation_failed( + async def test_check_and_process_evaluation_failed( self, mock_poll, mock_get_batch, db: Session, - test_dataset, - ): + test_dataset: EvaluationDataset, + ) -> None:
779-822: Add missing type hints and return type.
test_datasetis untyped and the return type is missing. As per coding guidelines, please annotate both.🔧 Proposed fix
- async def test_poll_all_pending_evaluations_with_runs( + async def test_poll_all_pending_evaluations_with_runs( self, mock_langfuse_client, mock_openai_client, mock_check, db: Session, - test_dataset, - ): + test_dataset: EvaluationDataset, + ) -> None:
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)
555-578: Consider renaming the test for clarity.The test now validates invalid
config_id, but the name still says “missing_model”. Renaming improves intent clarity.♻️ Suggested rename
-def test_start_batch_evaluation_missing_model( +def test_start_batch_evaluation_invalid_config_id( self, client: TestClient, user_api_key_header: dict[str, str] ) -> None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/crud/evaluations/embeddings.py`:
- Around line 368-369: The comment above the assignment to embedding_model is
misleading—change it to state that the model is a module-level constant rather
than being read from configuration; update the comment referencing
EMBEDDING_MODEL (used when assigning embedding_model = EMBEDDING_MODEL) to
something like "Use module-level EMBEDDING_MODEL constant" so it accurately
describes the behavior.
🧹 Nitpick comments (1)
backend/app/crud/evaluations/embeddings.py (1)
24-26: Consider adding an explicit type annotation to the constant.Per the coding guidelines recommending type hints throughout the codebase, adding an explicit type annotation improves clarity and tooling support.
Suggested change
# Default embedding model -EMBEDDING_MODEL = "text-embedding-3-large" +EMBEDDING_MODEL: str = "text-embedding-3-large"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/services/evaluations/evaluation.py`:
- Around line 94-111: The current flow assumes resolve_evaluation_config
returned a fully populated object and dereferences config.completion.provider,
which can be None/missing and cause AttributeError; update the handling after
resolve_evaluation_config (the block that checks error and then inspects
config.completion.provider) to first verify config is not None and that config
has a completion attribute with the expected fields (e.g., config.completion and
config.completion.provider and config.completion.params) before accessing
provider; if any are missing, raise an HTTPException with a 400 (or 422 as
appropriate) and a clear detail string like "Incomplete evaluation config:
missing completion/provider/params" so unresolved/partial configs return
controlled 4xx errors instead of a 500.
♻️ Duplicate comments (1)
backend/app/crud/evaluations/core.py (1)
372-392: Validate completion params/model before returning.
config.completion.params.modelcan be missing orNone, which violates the declaredstrreturn type and can break downstream calls. This mirrors a previously flagged issue.🔧 Proposed fix
- if not eval_run.config_id or not eval_run.config_version: + if eval_run.config_id is None or eval_run.config_version is None: raise ValueError( f"Evaluation run {eval_run.id} has no config reference " f"(config_id={eval_run.config_id}, config_version={eval_run.config_version})" ) @@ - model = config.completion.params.model - return model + if not config.completion or not config.completion.params: + raise ValueError( + f"Config for evaluation {eval_run.id} is missing completion params" + ) + + model = config.completion.params.model + if not model: + raise ValueError( + f"Config for evaluation {eval_run.id} does not specify a model " + f"(config_id={eval_run.config_id}, version={eval_run.config_version})" + ) + return model
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)
529-544: Prefer derivingconfig_versionfrom the created config.Hardcoding
config_version=1(e.g., Line 543, Line 767, Line 807, Line 868, Line 906) can drift ifcreate_test_configchanges its default versioning. Consider using the version exposed by the created config object to keep tests resilient.Also applies to: 759-768, 799-808, 862-868, 899-906
| # Step 2: Resolve config from stored config management | ||
| config, error = resolve_evaluation_config( | ||
| session=session, | ||
| config=config, | ||
| assistant_id=assistant_id, | ||
| config_id=config_id, | ||
| config_version=config_version, | ||
| project_id=project_id, | ||
| ) | ||
| if error: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Failed to resolve config from stored config: {error}", | ||
| ) | ||
| elif config.completion.provider != LLMProvider.OPENAI: | ||
| raise HTTPException( | ||
| status_code=422, | ||
| detail="Only 'openai' provider is supported for evaluation configs", | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against unresolved/partial config before dereferencing.
resolve_evaluation_config can return config=None or a blob missing completion/params. The current flow dereferences those fields, which can crash with AttributeError and return a 500 instead of a controlled 4xx.
🔧 Proposed fix
- config, error = resolve_evaluation_config(
+ config, error = resolve_evaluation_config(
session=session,
config_id=config_id,
config_version=config_version,
project_id=project_id,
)
- if error:
+ if error or config is None:
raise HTTPException(
status_code=400,
detail=f"Failed to resolve config from stored config: {error}",
)
- elif config.completion.provider != LLMProvider.OPENAI:
+ if not config.completion or not config.completion.params:
+ raise HTTPException(
+ status_code=422,
+ detail="Config is missing completion parameters required for evaluation",
+ )
+ if config.completion.provider != LLMProvider.OPENAI:
raise HTTPException(
status_code=422,
detail="Only 'openai' provider is supported for evaluation configs",
)🤖 Prompt for AI Agents
In `@backend/app/services/evaluations/evaluation.py` around lines 94 - 111, The
current flow assumes resolve_evaluation_config returned a fully populated object
and dereferences config.completion.provider, which can be None/missing and cause
AttributeError; update the handling after resolve_evaluation_config (the block
that checks error and then inspects config.completion.provider) to first verify
config is not None and that config has a completion attribute with the expected
fields (e.g., config.completion and config.completion.provider and
config.completion.params) before accessing provider; if any are missing, raise
an HTTPException with a 400 (or 422 as appropriate) and a clear detail string
like "Incomplete evaluation config: missing completion/provider/params" so
unresolved/partial configs return controlled 4xx errors instead of a 500.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/alembic/versions/041_add_config_in_evals_run_table.py`:
- Line 19: Add Python type hints to the migration functions: update the
signature of upgrade to include a return type (e.g., change def upgrade() to def
upgrade() -> None) and do the same for the corresponding downgrade function (def
downgrade() -> None), ensuring both functions have explicit type hints for
parameters (none) and return values to satisfy the codebase typing guidelines.
♻️ Duplicate comments (2)
backend/app/alembic/versions/041_add_config_in_evals_run_table.py (2)
19-41: Prevent data loss when droppingevaluation_run.config.Line 41 drops the old JSONB config without migrating existing data into
config_id/config_version, which will permanently lose configuration data for existing rows. Please add a data migration/backfill (or explicitly document why data loss is acceptable) before dropping the column.
45-53: Downgrade will fail if rows exist due to non-nullableconfig.
op.add_column(..., nullable=False)will fail on a non-empty table. Make it nullable (or add a default/backfill before enforcing NOT NULL).✅ Suggested fix
op.add_column( "evaluation_run", sa.Column( "config", postgresql.JSONB(astext_type=sa.Text()), autoincrement=False, - nullable=False, + nullable=True, comment="Evaluation configuration (model, instructions, etc.)", ), )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/app/crud/evaluations/batch.py`:
- Around line 103-110: The body dict currently always sets "instructions":
config.instructions which can send a None value; update the code that constructs
the request body (the body dict) to only add the "instructions" key when
config.instructions is not None, following the same conditional pattern used for
"reasoning" and "tools" (i.e., check config.instructions and only include the
field when present) so that the API never receives "instructions": None.
🧹 Nitpick comments (2)
backend/app/crud/evaluations/batch.py (2)
117-117: Redundant length check.The expression
config.knowledge_base_ids and len(config.knowledge_base_ids) > 0is redundant. A non-empty list is truthy, soif config.knowledge_base_ids:suffices.♻️ Proposed simplification
- if config.knowledge_base_ids and len(config.knowledge_base_ids) > 0: + if config.knowledge_base_ids:
154-156: Update docstring to reflect the new type.The docstring describes
configas a "dict with llm, instructions, vector_store_ids" but it is nowKaapiLLMParamswith fieldsmodel,instructions,knowledge_base_ids, etc.📝 Proposed docstring update
- config: Evaluation configuration dict with llm, instructions, vector_store_ids + config: KaapiLLMParams with model, instructions, knowledge_base_ids, etc.
kartpop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved with comments
|
|
||
| def build_evaluation_jsonl( | ||
| dataset_items: list[dict[str, Any]], config: dict[str, Any] | ||
| dataset_items: list[dict[str, Any]], config: KaapiLLMParams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure Text, TTS and STT config works, KaapiLLMParams looks like this in the STT PR
class TextLLMParams(SQLModel):
model: str
instructions: str | None = Field(
default=None,
)
knowledge_base_ids: list[str] | None = Field(
default=None,
description="List of vector store IDs to use for knowledge retrieval",
)
reasoning: Literal["low", "medium", "high"] | None = Field(
default=None,
description="Reasoning configuration or instructions",
)
temperature: float | None = Field(
default=None,
ge=0.0,
le=2.0,
)
max_num_results: int | None = Field(
default=None,
ge=1,
description="Maximum number of candidate results to return",
)
class STTLLMParams(SQLModel):
model: str
instructions: str
input_language: str | None = None
output_language: str | None = None
response_format: Literal["text"] | None = Field(
None,
description="Can take multiple response_format like text, json, verbose_json.",
)
temperature: float | None = Field(
default=0.2,
ge=0.0,
le=2.0,
)
class TTSLLMParams(SQLModel):
model: str
voice: str
language: str
response_format: Literal["mp3", "wav", "ogg"] | None = "wav"
speed: float | None = Field(None, ge=0.25, le=4.0)
KaapiLLMParams = Union[TextLLMParams, STTLLMParams, TTSLLMParams]
Union of all these related pydantic models. May be we can add Any type to not force type safety atm and once the STT PR gets merged, plan accordingly.
| session: Session, | ||
| eval_run: EvaluationRun, | ||
| config: dict[str, Any], | ||
| config: KaapiLLMParams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto as above
|
|
||
| class LLMProvider: | ||
| OPENAI_NATIVE = "openai-native" | ||
| OPENAI = "openai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LLMProvider by design takes only OPENAI_NATIVE, GOOGLE_NATIVE for effective flow from top. Once we add OPENAI also it will create issues with unified API. Let's remove this enum for now. Once the new STT PR gets merged, this extra field will probably not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will take this up in next PR
Summary
Target issue is #557 & #547
This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for
config_id,config_version, andmodelin the evaluation run table, streamlining the evaluation process and improving data integrity.Checklist
Before submitting a pull request, please ensure that you mark these tasks.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.